{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BiDAF Model Deep Dive on AzureML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/nlp/examples/question_answering/bidaf_aml_deep_dive.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook demonstrates a deep dive into a popular question-answering (QA) model, Bi-Directional Attention Flow (BiDAF). We use [AllenNLP](https://allennlp.org/), an open-source NLP research library built on top of PyTorch, to train the BiDAF model from scratch on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset, using Azure Machine Learning ([AzureML](https://azure.microsoft.com/en-us/services/machine-learning-service/)). \n",
    "\n",
    "The following capabilities are highlighted in this notebook:  \n",
    "- AmlCompute\n",
    "- Datastore\n",
    "- Logging\n",
    "- AllenNLP library"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Introduction](#1.-Introduction)  \n",
    "    * 1.1 [SQuAD Dataset](#1.1-SQuAD-Dataset)  \n",
    "    * 1.2 [BiDAF Model](#1.2-BiDAF-Model)  \n",
    "    * 1.3 [AllenNLP](#1.3-AllenNLP)  \n",
    "2. [AzureML Setup](#2.-AzureML-Setup)  \n",
    "    * 2.1 [Link to or create a `Workspace`](#2.1-Link-to-or-create-a-Workspace)  \n",
    "    * 2.2 [Set up an `Experiment` and Logging](#2.2-Set-up-an-Experiment-and-Logging)  \n",
    "    * 2.3 [Link `AmlCompute` compute target](#2.3-Link-AmlCompute-Compute-Target)  \n",
    "    * 2.4 [Upload Files to `Datastore`](#2.4-Upload-Files-to-Datastore)  \n",
    "3. [Prepare Training Script](#3.-Prepare-Training-Script) \n",
    "4. [Create a PyTorch Estimator](#4.-Create-a-PyTorch-Estimator)\n",
    "5. [Submit a Job](#5.-Submit-a-Job)  \n",
    "6. [Inspect Results of Run](#6.-Inspect-Results-of-Run)  \n",
    "    * 6.1 [Evaluation on SQuAD](#6.1-Evaluation-on-SQuAD)\n",
    "    * 6.2 [Try the Best Model](#6.2-Try-the-Best-Model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 SQuAD Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset was released in 2016 and has become a benchmarking dataset for machine comprehension tasks. It contains a set of more than 100,000 question-context tuples along with their answers, extracted from Wikipedia articles. 90,000 of the question-context tuples make up the training set and the remaining 10,000 compose the development set. The answers are spans in the context (given passage) and are evaluated against human-labeled answers. Two metrics are used for evaluation: Exact Match (EM) and F1 score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 BiDAF Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n",
    ") model achieved state-of-the-art performance on the SQuAD dataset in 2017 and is a well-respected, performant baseline for QA. The BiDAF network is a \"hierarchical multi-stage architecture for modeling representations of the context at different levels of granularity. BiDAF includes character-level, word-level, and phrase-level embeddings, and uses bi-directional attention flow to allow for query-aware context representations\". \n",
    "\n",
    "The network contains six different layers, as described by [Seo et al, 2017](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02):\n",
    "\n",
    "1. **Character Embedding Layer**: character-level CNNs to embed each word\n",
    "2. **Word Embedding Layer**: word embeddings using pre-trained GloVe word vectors\n",
    "3. **Phrase Embedding Layer**: LSTM on top of the previous layers to model the temporal interactions between words\n",
    "4. **Attention Flow Layer**: Fuses information from the context and query words. Unlike previous models, \"the attention flow layer is not used to summarize the query and context into a single feature vectors. Instead, the attention vectors at each time step, along with embeddings from previous layers, are allowed to flow through to the subsequent modeling layers\", reducing information loss.\n",
    "5. **Modeling Layer**: produces a matrix of contextual information about the word with respect to the entire context paragraph and query\n",
    "6. **Output Layer**: predicts the start and end indices of the phrase in the paragraph\n",
    "\n",
    "The following figure displays the architecture of the BiDAF network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://nlpbp.blob.core.windows.net/images/BiDAF_model.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 AllenNLP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook demonstrates how to use the BiDAF implementation provided by [AllenNLP](https://www.semanticscholar.org/paper/A-Deep-Semantic-Natural-Language-Processing-Gardner-Grus/a5502187140cdd98d76ae711973dbcdaf1fef46d), an open-source NLP research library built on top of PyTorch. AllenNLP is a product of the Allen Institute for Artifical Intelligence and is used widely across differnet universities and top companies (including Facebook Research and Amazon Alexa). They maintain a robust and active [Github repository](https://github.com/allenai/allennlp) as well as a [website](https://allennlp.org/) with documentation and demos. Their model is a reimplementation of the original BiDAF model and they report a higher EM score and faster training times than the original BiDAF system (68.3 EM score versus 67.7 and 10x speedup, taking ~4 hours on a p2.xlarge). The AllenNLP library is mainly designed for use through the command line (and most tutorials use this method), but can also be used programatically. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The AllenNLP library focuses on the creation of NLP pipelines with easily interchangable building blocks. The general pipeline steps are as follows:  \n",
    "- DatasetReader: defines how to extract information from your data and convert it into Instance objects that will be used by the model  \n",
    "- Iterator: takes the instances produced by the DatasetReader and batches them for training\n",
    "- Model\n",
    "- Trainer: trains the model and records metrics  \n",
    "- Predictor: takes raw strings and produces predictions\n",
    "\n",
    "Each step is loosely-coupled, making it easy to swap different options for each step. While it is possible to construct your own AllenNLP objects (see this [tutorial](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/) for a great deep-dive into constructing your own AllenNLP pipeline), the easiest way is to utilize the JSON-like parameter constructor methods provided by most AllenNLP objects. For example, rather than\n",
    "\n",
    "```\n",
    "lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))\n",
    "```\n",
    "we can use  \n",
    "\n",
    "```\n",
    "lstm_params = Params({\n",
    "    \"type\": \"lstm\",\n",
    "    \"input_size\": EMBEDDING_DIM,\n",
    "    \"hidden_size\": HIDDEN_DIM\n",
    "})\n",
    "\n",
    "lstm = Seq2SeqEncoder.from_params(lstm_params)\n",
    "```\n",
    "This provides two advantages:  \n",
    "1. Experiments can be declaratively specified in a separate [configuration file](https://github.com/allenai/allennlp/blob/master/tutorials/tagger/README.md#using-config-files)  \n",
    "2. Experiments can be easily changed with no coding, rather just changing the entry in the config file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**AllenNLP Resources:**\n",
    "\n",
    "The following resources are recommended for understanding how the AllenNLP library works and being able to implement your own models and pipelines\n",
    "\n",
    "- Information about the provided AllenNLP models: https://allennlp.org/models\n",
    "- Using configuration files: https://github.com/allenai/allennlp/blob/master/tutorials/tagger/README.md#using-config-files   \n",
    "- In-depth discussion of each AllenNLP object used and how to construct your own specialized ones: https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/  \n",
    "- AllenNLPs Part-of-Speech-Tagging tutorial showcasing how to use their methods programatically: https://allennlp.org/tutorials   \n",
    "- Short AllenNLP programatic tutorial: https://github.com/titipata/allennlp-tutorial/blob/master/allennlp_tutorial.ipynb  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
      "[GCC 7.3.0]\n",
      "Azure ML SDK Version: 1.0.48\n"
     ]
    }
   ],
   "source": [
    "# Imports\n",
    "import sys\n",
    "import os\n",
    "import shutil\n",
    "sys.path.append(\"../../\")\n",
    "import json\n",
    "from urllib.request import urlretrieve\n",
    "import scrapbook as sb\n",
    "\n",
    "#import utils\n",
    "from utils_nlp.common.timer import Timer\n",
    "from utils_nlp.azureml import azureml_utils\n",
    "\n",
    "import azureml as aml\n",
    "from azureml.core import Datastore, Experiment\n",
    "from azureml.core.compute import ComputeTarget, AmlCompute\n",
    "from azureml.exceptions import ComputeTargetException\n",
    "from azureml.train.dnn import PyTorch\n",
    "from azureml.widgets import RunDetails\n",
    "from azureml.core.conda_dependencies import CondaDependencies\n",
    "from azureml.exceptions import ComputeTargetException\n",
    "from allennlp.predictors import Predictor\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Azure ML SDK Version:\", aml.core.VERSION)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "PROJECT_FOLDER = \"./bidaf-question-answering\"\n",
    "SQUAD_FOLDER = \"./squad\"\n",
    "BIDAF_CONFIG_PATH = \".\"\n",
    "LOGS_FOLDER = '.'\n",
    "NUM_EPOCHS = 25\n",
    "PIP_PACKAGES = [\n",
    "        \"allennlp==0.8.4\",\n",
    "        \"azureml-sdk==1.0.48\",\n",
    "        \"https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz\",\n",
    "    ]\n",
    "CONDA_PACKAGES = [\"jsonnet\", \"cmake\", \"regex\", \"pytorch\", \"torchvision\"]\n",
    "config_path = (\n",
    "    \"./.azureml\"\n",
    ")  # Path to the directory containing config.json with azureml credentials\n",
    "\n",
    "# Azure resources\n",
    "subscription_id = \"YOUR_SUBSCRIPTION_ID\"\n",
    "resource_group = \"YOUR_RESOURCE_GROUP_NAME\"  \n",
    "workspace_name = \"YOUR_WORKSPACE_NAME\"  \n",
    "workspace_region = \"YOUR_WORKSPACE_REGION\" #Possible values eastus, eastus2 and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. AzureML Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we set up the necessary components for running this as an AzureML experiment\n",
    "1. Create or link to an existing `Workspace`\n",
    "2. Set up an `Experiment` with `logging`\n",
    "3. Create or attach existing `AmlCompute`\n",
    "4. Upload our data to a `Datastore`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Link to or create a Workspace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell looks to set up the connection to your [Azure Machine Learning service Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace). You can choose to connect to an existing workspace or create a new one. \n",
    "\n",
    "**To access an existing workspace:**\n",
    "1. If you have a `config.json` file, you do not need to provide the workspace information; you will only need to update the `config_path` variable that is defined above which contains the file.\n",
    "2. Otherwise, you will need to supply the following:\n",
    "    * The name of your workspace\n",
    "    * Your subscription id\n",
    "    * The resource group name\n",
    "\n",
    "**To create a new workspace:**\n",
    "\n",
    "Set the following information:\n",
    "* A name for your workspace\n",
    "* Your subscription id\n",
    "* The resource group name\n",
    "* [Azure region](https://azure.microsoft.com/en-us/global-infrastructure/regions/) to create the workspace in, such as `eastus2`. \n",
    "\n",
    "This will automatically create a new resource group for you in the region provided if a resource group with the name given does not already exist. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Performing interactive authentication. Please follow the instructions on the terminal.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING - Note, we have launched a browser for you to login. For old experience with device code, use \"az login --use-device-code\"\n",
      "WARNING - You have logged in. Now let us find all the subscriptions to which you have access...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Interactive authentication successfully completed.\n"
     ]
    }
   ],
   "source": [
    "ws = azureml_utils.get_or_create_workspace(\n",
    "    config_path=config_path,\n",
    "    subscription_id=subscription_id,\n",
    "    resource_group=resource_group,\n",
    "    workspace_name=workspace_name,\n",
    "    workspace_region=workspace_region,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Workspace name: \" + ws.name,\n",
    "    \"Azure region: \" + ws.location,\n",
    "    \"Subscription id: \" + ws.subscription_id,\n",
    "    \"Resource group: \" + ws.resource_group,\n",
    "    sep=\"\\n\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Set up an Experiment and Logging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we set up an `Experiment` named bidaf-question-answering, add logging capabilities, and create a local folder that will be the source directory for the AzureML run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make a folder for the project\n",
    "os.makedirs(PROJECT_FOLDER, exist_ok=True)\n",
    "\n",
    "# Set up an experiment\n",
    "experiment_name = \"NLP-QA-BiDAF-deepdive\"\n",
    "experiment = Experiment(ws, experiment_name)\n",
    "\n",
    "# Add logging to our experiment\n",
    "run = experiment.start_logging(snapshot_directory=PROJECT_FOLDER)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Link AmlCompute Compute Target\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. \n",
    "\n",
    "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found existing compute target.\n",
      "{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-23T16:18:34.392000+00:00', 'errors': None, 'creationTime': '2019-07-09T16:20:30.625908+00:00', 'modifiedTime': '2019-07-09T16:20:46.601973+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}\n"
     ]
    }
   ],
   "source": [
    "# choose your cluster\n",
    "cluster_name = \"gpu-test\"\n",
    "\n",
    "try:\n",
    "    compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
    "    print(\"Found existing compute target.\")\n",
    "except ComputeTargetException:\n",
    "    print(\"Creating a new compute target...\")\n",
    "    compute_config = AmlCompute.provisioning_configuration(\n",
    "        vm_size=\"STANDARD_NC6\", max_nodes=4\n",
    "    )\n",
    "\n",
    "    # create the cluster\n",
    "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
    "\n",
    "    compute_target.wait_for_completion(show_output=True)\n",
    "\n",
    "# use get_status() to get a detailed status for the current AmlCompute.\n",
    "print(compute_target.get_status().serialize())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Upload Files to Datastore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This step uploads our local files to a `Datastore` so that the data is accessible from the remote compute target. A DataStore is backed either by a Azure File Storage (default option) or Azure Blob Storage ([how to decide between these options](https://docs.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks)) and data is made accessible by mounting or copying data to the compute target. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we download the SQuAD data files and save to a folder called squad."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('./squad/squad_dev.json', <http.client.HTTPMessage at 0x2646892de10>)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "os.makedirs(SQUAD_FOLDER, exist_ok=True)  # make squad folder locally\n",
    "\n",
    "urlretrieve(\n",
    "    \"https://allennlp.s3.amazonaws.com/datasets/squad/squad-train-v1.1.json\",\n",
    "    filename=SQUAD_FOLDER+\"/squad_train.json\",\n",
    ")\n",
    "\n",
    "urlretrieve(\n",
    "    \"https://allennlp.s3.amazonaws.com/datasets/squad/squad-dev-v1.1.json\",\n",
    "    filename=SQUAD_FOLDER+\"/squad_dev.json\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also copy our AllenNLP configuration file (bidaf_config.json) into this squad folder so that it can be uploaded to the `Datastore` and accessed during training. As described in [Section 1.3](#1.3-AllenNLP), this configuration files allows us to easily specify the parameters for instantiating AllenNLP objects. This file contains a dictionary of dictionaries. The top level contains 4 main keys: dataset_reader, model, iterator, and trainer (plus keys for train_data_path, validation_data_path, and evaluate_on_test). If you notice carefully from [Section 1.3](#1.3-AllenNLP), these correspond to the AllenNLP object building blocks. Each of these keys map to a dictionary of parameters. For instance, the trainer dictionary contains keys to specify the number of epochs, learning rate scheduler, optimizer, etc. The parameter settings provided here are the ones suggested by AllenNLP for the BiDAF model; however, below we demonstrate how to override these parameters without having to change this configuration file directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'./squad\\\\bidaf_config.json'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "shutil.copy(BIDAF_CONFIG_PATH+'/bidaf_config.json', SQUAD_FOLDER)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we upload both the SQuAD data files as well as the configuration file to the datastore. `ws.datastores` lists all options for datastores and `ds.account_name` gets the name of the datastore that can be used to find it in the Azure portal. Once we have selected the appropriate datastore, we use the `upload()` method to upload all files from the squad local folder to a folder on the datastore called squad_data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading an estimated of 3 files\n",
      "Uploading ./squad\\bidaf_config.json\n",
      "Uploading ./squad\\squad_dev.json\n",
      "Uploading ./squad\\squad_train.json\n",
      "Uploaded ./squad\\bidaf_config.json, 1 files out of an estimated total of 3\n",
      "Uploaded ./squad\\squad_dev.json, 2 files out of an estimated total of 3\n",
      "Uploaded ./squad\\squad_train.json, 3 files out of an estimated total of 3\n",
      "Uploaded 3 files\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "$AZUREML_DATAREFERENCE_09a567b57ea546b697d8d7ce1bcf2d86"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select a specific datastore or you can call ws.get_default_datastore()\n",
    "datastore_name = \"workspacefilestore\"\n",
    "ds = ws.datastores[datastore_name]\n",
    "\n",
    "# Upload files in squad data folder to the datastore\n",
    "ds.upload(\n",
    "    src_dir=SQUAD_FOLDER, target_path=\"squad_data\", overwrite=True, show_progress=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Prepare Training Script"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we create a simple training script that uses AllenNLP's `train_model_from_file()` function containing the following parameters:  \n",
    "- parameter_filename (str) : A json parameter file specifying an AllenNLP experiment\n",
    "- serialization_dir (str): The directory in which to save results and logs\n",
    "- overrides (str): A JSON string that we will use to override values in the input parameter file\n",
    "- file_friendly_logging (bool, optional): If True, we make our output more friendly to saved model files\n",
    "- recover (bool, optional): If True, we will try to recover a training run from an existing serialization\n",
    "\n",
    "Our training script parameters are: the location of the data folder, name of the configuration file, and JSON string with any overrides for the configuration file. See the [documentation](https://github.com/allenai/allennlp/blob/9a13ab570025a0c1659986009d2abddb2e652020/allennlp/commands/train.py) on AllenNLP `train_model_from_file()` function for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting ./bidaf-question-answering/train.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile $PROJECT_FOLDER/train.py\n",
    "import torch\n",
    "import argparse\n",
    "import os\n",
    "import shutil\n",
    "from allennlp.common import Params\n",
    "from allennlp.commands.train import train_model_from_file\n",
    "\n",
    "def main():\n",
    "    # get command-line arguments\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument('--data_folder', type=str, \n",
    "                        help='Folder where data is stored')\n",
    "    parser.add_argument('--config_name', type=str, \n",
    "                        help='Name of json configuration file')\n",
    "    parser.add_argument('--overrides', type=str, \n",
    "                        help='Override parameters on config file')\n",
    "    args = parser.parse_args()\n",
    "    squad_folder = os.path.join(args.data_folder, \"squad_data\")\n",
    "    serialization_folder = \"./logs\" #save to the run logs folder\n",
    "    \n",
    "    #delete log file if it already exists\n",
    "    if os.path.isdir(serialization_folder):\n",
    "        shutil.rmtree(serialization_folder)\n",
    "        \n",
    "    train_model_from_file(parameter_filename = os.path.join(squad_folder, args.config_name),\n",
    "           overrides = args.overrides,\n",
    "           serialization_dir = serialization_folder,\n",
    "           file_friendly_logging = True,\n",
    "           recover = False)\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Create a PyTorch Estimator\n",
    "\n",
    "AllenNLP is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). First we set up a .yml file with the necessary dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'bidafenv.yml'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "myenv = CondaDependencies.create(\n",
    "    conda_packages= CONDA_PACKAGES,\n",
    "    pip_packages= PIP_PACKAGES,\n",
    "    python_version=\"3.6.8\",\n",
    ")\n",
    "myenv.add_channel(\"conda-forge\")\n",
    "myenv.add_channel(\"pytorch\")\n",
    "\n",
    "conda_env_file_name = \"bidafenv.yml\"\n",
    "myenv.save_to_file(PROJECT_FOLDER, conda_env_file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We next define any parameters in the configuration file that we want to override for this specific training run. We demonstrate overriding the num_epochs parameter to perform 25 epochs (rather than 20 epochs as set in bidaf_config.json). The AllenNLP training function expects that overrides are a JSON string, so we convert our dictionary into a JSON string before passing it in as an argument to our training script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "overrides = {\"trainer\":{'num_epochs': NUM_EPOCHS}}\n",
    "overrides = json.dumps(overrides)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the parameters to pass to the training script, the project folder, compute target, conda dependencies file, and the name of the training script. Notice that we set `use_gpu` equal to True. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING - If environment_definition or conda_dependencies_file is specified, Azure ML will not install any framework related packages on behalf of the user.\n",
      "WARNING - framework_version is not specified, defaulting to version 1.1.\n",
      "WARNING - You have specified to install packages in your run. Note that Azure ML also installs the following packages on your behalf: ['torchvision']. \n",
      "This may lead to unexpected package installation errors. Take a look at `estimator.conda_dependencies` to understand what packages are installed by Azure ML.\n"
     ]
    }
   ],
   "source": [
    "script_params = {\n",
    "    \"--data_folder\": ds.as_mount(),\n",
    "    \"--config_name\": \"bidaf_config.json\",\n",
    "    \"--overrides\": overrides,\n",
    "}\n",
    "\n",
    "estimator = PyTorch(\n",
    "    source_directory=PROJECT_FOLDER,\n",
    "    script_params=script_params,\n",
    "    compute_target=compute_target,\n",
    "    entry_script=\"train.py\",\n",
    "    use_gpu=True,\n",
    "    conda_dependencies_file=\"bidafenv.yml\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Submit a Job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Run(Experiment: bidaf-question-answering,\n",
      "Id: bidaf-question-answering_1563899344_bce3c688,\n",
      "Type: azureml.scriptrun,\n",
      "Status: Starting)\n"
     ]
    }
   ],
   "source": [
    "run = experiment.submit(estimator)\n",
    "print(run)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3da61f9cf1a84f91ae23925843b584d7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "RunDetails(run).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "#wait for the run to complete before continuing in the notebook\n",
    "run.wait_for_completion() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Cancel the Job**\n",
    "\n",
    "Interrupting/restarting the Jupyter kernel will not properly cancel the run, which can lead to wasted compute resources. To avoid this, we recommend explicitly canceling a run with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "#run.cancel()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Inspect Results of Run "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AllenNLP's training saves all intermediate and final results to the serialization_dir (defined in train.py). In order to inspect the results as well as use the trained model, we will download the files from the run logs using the `download_files()` command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "run.download_files(prefix=\"./logs\", output_directory=LOGS_FOLDER)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.1 Evaluation on SQuAD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The metrics.json file contains the final metrics. We can load this file and extract the final SQuAD dev set EM score (key is 'best_validation_em'). AllenNLP reports an EM score of 68.3, so depending on the parameters specified in your config file, expect a score in that range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/scrapbook.scrap.json+json": {
       "data": 0.6152317880794702,
       "encoder": "json",
       "name": "validation_EM",
       "version": 1
      }
     },
     "metadata": {
      "scrapbook": {
       "data": true,
       "display": false,
       "name": "validation_EM"
      }
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.6152317880794702"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(LOGS_FOLDER+\"/logs/metrics.json\") as f:\n",
    "    metrics = json.load(f)\n",
    "\n",
    "sb.glue(\"validation_EM\", metrics[\"best_validation_em\"])\n",
    "metrics[\"best_validation_em\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 Try the Best Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to use our model, we need to create an AllenNLP [Predictor](https://github.com/allenai/allennlp/blob/master/allennlp/predictors/predictor.py) object. We instantiate this object from an archive path. An archive comprises a Model and its experimental configuration file. After training a model, the archive is saved to the serialization_dir (whose path is set in train.py)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING - _jsonnet not loaded, treating ./logs\\config.json as json\n"
     ]
    }
   ],
   "source": [
    "model = Predictor.from_path(LOGS_FOLDER+\"/logs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Predictor object allows us to directly pass in a question and passage (behind the scenes it converts this to Instance objects using the DatasetReader). We define an example passage/question, call the model's `predict()` function, and finally extract the `best_span_str` attribute which contains the answer to our query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "passage = \"Machine Comprehension (MC), answering questions about a given context, \\\n",
    "requires modeling complex interactions between the context and the query. Recently,\\\n",
    "attention mechanisms have been successfully extended to MC. Typically these mechanisms\\\n",
    "use attention to summarize the query and context into a single vector, couple \\\n",
    "attentions temporally, and often form a uni-directional attention. In this paper \\\n",
    "we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage \\\n",
    "hierarchical process that represents the context at different levels of granularity \\\n",
    "and uses a bi-directional attention flow mechanism to achieve a query-aware context \\\n",
    "representation without early summarization. Our experimental evaluations show that \\\n",
    "our model achieves the state-of-the-art results in Stanford QA (SQuAD) and \\\n",
    "CNN/DailyMail Cloze Test datasets.\"\n",
    "\n",
    "question = \"What dataset does BIDAF achieve state-of-the-art results on?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = model.predict(question, passage)[\"best_span_str\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Stanford QA'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (nlp_gpu)",
   "language": "python",
   "name": "nlp_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
